Bayes estimator

In estimation theory and decision theory, a Bayes estimator or a Bayes action is an estimator or decision rule that minimizes the posterior expected value of a loss function (i.e., the posterior expected loss). Equivalently, it maximizes the posterior expectation of a utility function. An alternative way of formulating an estimator within Bayesian statistics is Maximum a posteriori estimation.

Definition

Suppose an unknown parameter θ is known to have a prior distribution $\pi$ . Let $\delta = \delta(x)$ be an estimator of θ (based on some measurements x), and let $L(\theta,\delta)$ be a loss function, such as squared error. The Bayes risk of $\delta$ is defined as $E_\pi \{ L(\theta, \delta) \}$ , where the expectation is taken over the probability distribution of $\theta$ : this defines the risk function as a function of $\delta$ . An estimator $\delta$ is said to be a Bayes estimator if it minimizes the Bayes risk among all estimators. Equivalently, the estimator which minimizes the posterior expected loss $E \{ L(\theta,\delta) | x \}$ for each x also minimizes the Bayes risk and therefore is a Bayes estimator.^[1]

If the prior is improper then an estimator which minimizes the posterior expected loss for each x is called a generalized Bayes estimator.^[2]

Examples

Minimum mean square error estimation

Main article: Minimum mean square error

The most common risk function used for Bayesian estimation is the mean square error (MSE), also called squared error risk. The MSE is defined by

$\mathrm{MSE} = E\left[ (\widehat{\theta}(x) - \theta)^2 \right],$

where the expectation is taken over the joint distribution of $\theta$ and $x$ .

Posterior mean

Using the MSE as risk, the Bayes estimate of the unknown parameter is simply the mean of the posterior distribution,

$\widehat{\theta}(x) = E[\theta |x]=\int \theta \pi(\theta |x)\,d\theta.$

This is known as the minimum mean square error (MMSE) estimator. The Bayes risk, in this case, is the posterior variance.

Bayes estimators for conjugate priors

Main article: Conjugate prior

If there is no inherent reason to prefer one prior probability distribution over another, a conjugate prior is sometimes chosen for simplicity. A conjugate prior is defined as a prior distribution belonging to some parametric family, for which the resulting posterior distribution also belongs to the same family. This is an important property, since the Bayes estimator, as well as its statistical properties (variance, confidence interval, etc.), can all be derived from the posterior distribution.

Conjugate priors are especially useful for sequential estimation, where the posterior of the current measurement is used as the prior in the next measurement. In sequential estimation, unless a conjugate prior is used, the posterior distribution typically becomes more complex with each added measurement, and the Bayes estimator cannot usually be calculated without resorting to numerical methods.

Following are some examples of conjugate priors.

If x|θ is normal, x|θ ~ N(θ,σ²), and the prior is normal, θ ~ N(μ,τ²), then the posterior is also normal and the Bayes estimator under MSE is given by

$\widehat{\theta}(x)=\frac{\sigma^{2}}{\sigma^{2}%2B\tau^{2}}\mu%2B\frac{\tau^{2}}{\sigma^{2}%2B\tau^{2}}x.$

If x₁,...,x_n are iid Poisson random variables x_i|θ ~ P(θ), and if the prior is Gamma distributed θ ~ G(a,b), then the posterior is also Gamma distributed, and the Bayes estimator under MSE is given by

$\widehat{\theta}(X)=\frac{n\overline{X}%2Ba}{n%2B\frac{1}{b}}.$

If x₁,...,x_n are iid uniformly distributed x_i|θ~U(0,θ), and if the prior is Pareto distributed θ~Pa(θ₀,a), then the posterior is also Pareto distributed, and the Bayes estimator under MSE is given by

$\widehat{\theta}(X)=\frac{(a%2Bn)\max{(\theta_0,x_1,...,x_n)}}{a%2Bn-1}.$

Alternative risk functions

Risk functions are chosen depending on how one measures the distance between the estimate and the unknown parameter. The MSE is the most common risk function in use, primarily due to its simplicity. However, alternative risk functions are also occasionally used. The following are several examples of such alternatives. We denote the posterior generalized distribution function by $F$ .

Posterior median and other quantiles

A "linear" loss function, with $a>0$ , which yields the posterior median as the Bayes' estimate:

$L(\theta,\widehat{\theta}) = a|\theta-\widehat{\theta}|$

$F(\widehat{\theta }(x)|X) = \tfrac{1}{2}.$

Another "linear" loss function, which assigns different "weights" $a,b>0$ to over or sub estimation. It yields a quantile from the posterior distribution, and is a generalization of the previous loss function:

$L(\theta,\widehat{\theta}) = \begin{cases} a|\theta-\widehat{\theta}|, & \mbox{for }\theta-\widehat{\theta} \ge 0 \\ b|\theta-\widehat{\theta}|, & \mbox{for }\theta-\widehat{\theta} < 0 \end{cases}$

$F(\widehat{\theta }(x)|X) = \frac{a}{a%2Bb}.$

Posterior mode

The following loss function is trickier: it yields either the posterior mode, or a point close to it depending on the curvature and properties of the posterior distribution. Small values of the parameter $K>0$ are recommended, in order to use the mode as an approximation ( $L>0$ ):

$L(\theta,\widehat{\theta}) = \begin{cases} 0, & \mbox{for }|\theta-\widehat{\theta}| < K \\ L, & \mbox{for }|\theta-\widehat{\theta}| \ge K. \end{cases}$

Other loss functions can be conceived, although the mean squared error is the most widely used and validated.

Generalized Bayes estimators

The prior distribution $\pi$ has thus far been assumed to be a true probability distribution, in that

$\int \pi(\theta) d\theta = 1.$

However, occasionally this can be a restrictive requirement. For example, there is no distribution (covering the set, R, of all real numbers) for which every real number is equally likely. Yet, in some sense, such a "distribution" seems like a natural choice for a non-informative prior, i.e., a prior distribution which does not imply a preference for any particular value of the unknown parameter. One can still define a function $\pi(\theta) = 1$ , but this would not be a proper probability distribution since it has infinite mass,

$\int{\pi(\theta)d\theta}=\infty.$

Such measures $\pi(\theta)$ , which are not probability distributions, are referred to as improper priors.

The use of an improper prior means that the Bayes risk is undefined (since the prior is not a probability distribution and we cannot take an expectation under it). As a consequence, it is no longer meaningful to speak of a Bayes estimator that minimizes the Bayes risk. Nevertheless, in many cases, one can define the posterior distribution

$\pi(\theta|x) = \frac{p(x|\theta) \pi(\theta)}{\int p(x|\theta) \pi(\theta) d\theta}.$

This is a definition, and not an application of Bayes' theorem, since Bayes' theorem can only be applied when all distributions are proper. However, it is not uncommon for the resulting "posterior" to be a valid probability distribution. In this case, the posterior expected loss

$\int{L(\theta,a)\pi(\theta|x)d\theta}$

is typically well-defined and finite. Recall that, for a proper prior, the Bayes estimator minimizes the posterior expected loss. When the prior is improper, an estimator which minimizes the posterior expected loss is referred to as a generalized Bayes estimator.^[2]

Example

A typical example concerns the estimation of a location parameter with a loss function of the type $L(a-\theta)$ . Here $\theta$ is a location parameter, i.e., $p(x|\theta) = f(x-\theta)$ .

It is common to use the improper prior $\pi(\theta)=1$ in this case, especially when no other more subjective information is available. This yields

$\pi(\theta|x) = \frac{p(x|\theta) \pi(\theta)}{p(x)} = \frac{f(x-\theta)}{p(x)}$

so the posterior expected loss equals

$E[L(a-\theta)] = \int{L(a-\theta) \pi(\theta|x) d\theta} = \frac{1}{p(x)} \int L(a-\theta) f(x-\theta) d\theta.$

The generalized Bayes estimator is the value $a(x)$ which minimizes this expression for all $x$ . This is equivalent to minimizing

$\int L(a-\theta) f(x-\theta) d\theta$ for all $x.$ (1)

It can be shown that, in this case, the generalized Bayes estimator has the form $x%2Ba_0$ , for some constant $a_0$ . To see this, let $a_0$ be the value minimizing (1) when $x=0$ . Then, given a different value $x_1$ , we must minimize

$\int L(a-\theta) f(x_1-\theta) d\theta = \int L(a-x_1-\theta') f(-\theta') d\theta'.$ (2)

This is identical to (1), except that $a$ has been replaced by $a-x_1$ . Thus, the expression minimizing is given by $a-x_1 = a_0$ , so that the optimal estimator has the form

$a(x) = a_0 %2B x.\,\!$

Empirical Bayes estimators

Main article: Empirical Bayes method

A Bayes estimator derived through the empirical Bayes method is called an empirical Bayes estimator. Empirical Bayes methods enable the use of auxiliary empirical data, from observations of related parameters, in the development of a Bayes estimator. This is done under the assumption that the estimated parameters are obtained from a common prior. For example, if independent observations of different parameters are performed, then the estimation performance of a particular parameter can sometimes be improved by using data from other observations.

There are parametric and non-parametric approaches to empirical Bayes estimation. Parametric empirical Bayes is usually preferable since it is more applicable and more accurate on small amounts of data.^[3]

Example

The following is a simple example of parametric empirical Bayes estimation. Given past observations $x_1,\ldots,x_n$ having conditional distribution $f(x_i|\theta_i)$ , one is interested in estimating $\theta_{n%2B1}$ based on $x_{n%2B1}$ . Assume that the $\theta_i$ 's have a common prior $\pi$ which depends on unknown parameters. For example, suppose that $\pi$ is normal with unknown mean $\mu_\pi\,\!$ and variance $\sigma_\pi\,\!.$ We can then use the past observations to determine the mean and variance of $\pi$ in the following way.

First, we estimate the mean $\mu_m\,\!$ and variance $\sigma_m\,\!$ of the marginal distribution of $x_1, \ldots, x_n$ using the maximum likelihood approach:

$\widehat{\mu}_m=\frac{1}{n}\sum{x_i},$

$\widehat{\sigma}_m^{2}=\frac{1}{n}\sum{(x_i-\widehat{\mu}_m)^{2}}.$

Next, we use the relation

$\mu_m=E_\pi[\mu_f(\theta)] \,\!,$

$\sigma_m^{2}=E_\pi[\sigma_f^{2}(\theta)]%2BE_\pi[\mu_f(\theta)-\mu_m],$

where $\mu_f(\theta)$ and $\sigma_f(\theta)$ are the moments of the conditional distribution $f(x_i|\theta_i)$ , which are assumed to be known. In particular, suppose that $\mu_f(\theta) = \theta$ and that $\sigma_f^{2}(\theta) = K$ ; we then have

$\mu_\pi=\mu_m \,\!,$

$\sigma_\pi^{2}=\sigma_m^{2}-\sigma_f^{2}=\sigma_m^{2}-K .$

Finally, we obtain the estimated moments of the prior,

$\widehat{\mu}_\pi=\widehat{\mu}_m,$

$\widehat{\sigma}_\pi^{2}=\widehat{\sigma}_m^{2}-K.$

For example, if $x_i|\theta_i \sim N(\theta_i,1)$ , and if we assume a normal prior (which is a conjugate prior in this case), we conclude that $\theta_{n%2B1}\sim N(\widehat{\mu}_\pi,\widehat{\sigma}_\pi^{2})$ , from which the Bayes estimator of $\theta_{n%2B1}$ based on $x_{n%2B1}$ can be calculated.

Properties

Admissibility

Asymptotic efficiency

Let θ be an unknown random variable, and suppose that $x_1,x_2,\ldots$ are iid samples with density $f(x_i|\theta)$ . Let $\delta_n = \delta_n(x_1,\ldots,x_n)$ be a sequence of Bayes estimators of θ based on an increasing number of measurements. We are interested in analyzing the asymptotic performance of this sequence of estimators, i.e., the performance of $\delta_n$ for large n.

To this end, it is customary to regard θ as a deterministic parameter whose true value is $\theta_0$ . Under specific conditions,^[5] for large samples (large values of n), the posterior density of θ is approximately normal. In other words, for large n, the effect of the prior probability on the posterior is negligible. Moreover, if δ is the Bayes estimator under MSE risk, then it is asymptotically unbiased and it converges in distribution to the normal distribution:

$\sqrt{n}(\delta_n - \theta_0) \to N\left(0 , \frac{1}{I(\theta_0)}\right),$

where I(θ₀) is the fisher information of θ₀. It follows that the Bayes estimator δ_n under MSE is asymptotically efficient.

Another estimator which is asymptotically normal and efficient is the maximum likelihood estimator (MLE). The relations between the maximum likelihood and Bayes estimators can be shown in the following simple example.

Consider the estimator of θ based on binomial sample x~b(θ,n) where θ denotes the probability for success. Assuming θ is distributed according to the conjugate prior, which in this case is the Beta distribution B(a,b), the posterior distribution is known to be B(a+x,b+n-x). Thus, the Bayes estimator under MSE is

$\delta_n(x)=E[\theta|x]=\frac{a%2Bx}{a%2Bb%2Bn}.$

The MLE in this case is x/n and so we get,

$\delta_n(x)=\frac{a%2Bb}{a%2Bb%2Bn}E[\theta]%2B\frac{n}{a%2Bb%2Bn}\delta_{MLE}.$

The last equation implies that, for n → ∞, the Bayes estimator (in the described problem) is close to the MLE.

On the other hand, when n is small, the prior information is still relevant to the decision problem and affects the estimate. To see the relative weight of the prior information, assume that a=b; in this case each measurement brings in 1 new bit of information; the formula above shows that the prior information has the same weight as a+b bits of the new information. In applications, one often knows very little about fine details of the prior distribution; in particular, there is no reason to assume that it coincides with B(a,b) exactly. In such a case, one possible interpretation of this calculation is: "there is a non-pathological prior distribution with the mean value 0.5 and the standard deviation d which gives the weight of prior information equal to 1/(4d²)-1 bits of new information."

Practical example of misapplication of Bayes estimators

For many years, the Internet Movie Database has used a formula for calculating the Top Rated 250 Titles which is claimed to give "a true Bayesian estimate"^[6]:

$W = {Rv %2B Cm\over v%2Bm}\$

where:

$W\$ = weighted rating

$R\$ = average for the movie as a number from 0 to 10 (mean) = (Rating)

$v\$ = number of votes for the movie = (votes)

$m\$ = minimum votes required to be listed in the Top 250 (currently 3000)

$C\$ = the mean vote across the whole report (currently 6.9)

for the Top 250, only votes from regular voters are considered.

Comparing this formula with one in the preceding section, one can see that m must have been related to the relative weight of the prior information in units of the new information given by one vote. Hence C must be the mean vote across the movies with more than 3000 votes, and m should be related to the deviation of votes in this pool.

For example, assume that a new vote brings in about 2 bits of information (one bit for above/below average, and 1 bit for "how far from average is the vote" - so this assumes that votes are close to the average, but not very close). Then having m=3000 corresponds to prior information weighting 6000 bits. To illustrate to which kind of prior distribution such a giant weight might correspond, consider again the distribution of the preceding section (it is related to a very different process of measurement, but the order of magnitude should be close); then 6000 bits correspond to d about 1/155 of the possible range (1 to 10), or 1/17.

Needless to say that such a small deviation is absurd - the minimal possible deviation given integer values and the average of 6.9 is 0.3. For example, if the actual deviation is about 0.7, this corresponds to the prior information weighting close to 40 bits, and m being about 20. Of course, with such a small value for m, the formula above becomes practically indistinguishable from the common-sense formula W=R expected with such a high entrance threshold as having 3000 votes.

Notes

^ Lehmann and Casella, Theorem 4.1.1
^ ^a ^b Lehmann and Casella, Definition 4.2.9
^ Berger (1980), section 4.5.
^ Lehmann and Casella (1998), Theorem 5.2.4.
^ Lehmann and Casella (1998), section 6.8
^ IMDb Top 250

References

Lehmann, E. L.; Casella, G. (1998). Theory of Point Estimation. Springer. p. 2nd ed. ISBN 0-387-98502-6.
Berger, James O. (1985). Statistical decision theory and Bayesian Analysis (2nd ed.). New York: Springer-Verlag. ISBN 0-387-96098-8. MR 0804611.

External links

Bayesian estimation on cnx.org

Statistics

Descriptive statistics

Continuous data

Location	Mean (Arithmetic, Geometric, Harmonic) Median Mode

Dispersion	Range Standard deviation Coefficient of variation Percentile Interquartile range

Shape	Variance Skewness Kurtosis Moments L-moments

Count data

Index of dispersion

Summary tables

Dependence

Statistical graphics

Data collection

Designing studies	Effect size Standard error Statistical power Sample size determination

Survey methodology	Sampling Stratified sampling Opinion poll Questionnaire

Controlled experiment	Design of experiments Randomized experiment Random assignment Replication Blocking Factorial experiment Optimal design

Uncontrolled studies	Natural experiment Quasi-experiment Observational study

Statistical inference

Statistical theory	Sampling distribution Sufficient statistic Meta-analysis

Bayesian inference	Bayesian probability Prior Posterior Credible interval Bayes factor Bayesian estimator Maximum posterior estimator

Frequentist inference	Confidence interval Hypothesis testing Likelihood-ratio

Specific tests	Z-test (normal) Student's t-test F-test Pearson's chi-squared test Wald test Mann–Whitney U Shapiro–Wilk Signed-rank Kolmogorov–Smirnov test

General estimation	Bias Robustness Efficiency Maximum likelihood Method of moments Minimum distance Density estimation

Correlation and regression analysis

Correlation	Pearson product-moment correlation Partial correlation Confounding variable Coefficient of determination

Regression analysis	Errors and residuals Regression model validation Mixed effects models Simultaneous equations models

Linear regression	Simple linear regression Ordinary least squares General linear model Bayesian regression

Non-standard predictors	Nonlinear regression Nonparametric Semiparametric Isotonic Robust

Generalized linear model	Exponential families Logistic (Bernoulli) Binomial Poisson

Partition of variance	Analysis of variance (ANOVA) Analysis of covariance Multivariate ANOVA Degrees of freedom

Categorical, multivariate, time-series, or survival analysis

Categorical data	Cohen's kappa Contingency table Graphical model Log-linear model McNemar's test

Multivariate statistics	Multivariate regression Principal components Factor analysis Cluster analysis Copulas

Time series analysis	Decomposition (Trend, Stationary process) ARMA model ARIMA model Vector autoregression Spectral density estimation

Survival analysis	Survival function Kaplan–Meier Logrank test Failure rate Proportional hazards models Accelerated failure time model

Applications

Biostatistics	Bioinformatics Biometrics Clinical trials & studies Epidemiology Medical statistics

Engineering statistics	Chemometrics Methods engineering Probabilistic design Process & Quality control Reliability System identification

Social statistics	Actuarial science Census Crime statistics Demography Econometrics National accounts Official statistics Population Psychometrics

Spatial statistics	Cartography Environmental statistics Geographic information system Geostatistics Kriging

Category
Portal
Outline
Index

Bayes estimator

Contents

Definition

Examples

Minimum mean square error estimation

Posterior mean

Bayes estimators for conjugate priors

Alternative risk functions

Posterior median and other quantiles

Posterior mode

Generalized Bayes estimators

Example

Empirical Bayes estimators

Example

Properties

Admissibility

Asymptotic efficiency

Practical example of misapplication of Bayes estimators

See also

Notes

References

External links